Goto

Collaborating Authors

 marginal effect




Towards a Comprehensive Scaling Law of Mixture-of-Experts

arXiv.org Artificial Intelligence

Mixture-of-Experts (MoE) models have become the consensus approach for enabling parameter-efficient scaling and cost-effective deployment in large language models. However, existing scaling laws for dense models are inapplicable to MoE models, which stems from three critical challenges: the multiplicity of influencing factors, their intricate coupling relationships and the non-monotonic nature of their performance impacts. Specifically, we design 446 controlled experiments to characterize their marginal effects, ultimately constructing a comprehensive and precise joint MoE scaling law that considers all essential factors. Our results demonstrate that the optimal settings for G and S are independent of both the model architecture and data size. Our proposed MoE scaling law could function as an accurate and insightful guidance to facilitate future MoE model design and training. Large language models (LLMs) have been widely verified and utilized in our daily lives. It is impressive and lucky to discover that LLMs can continuously expand its ability boundaries with increasing model and training data sizes. The scaling laws of LLMs (Kaplan et al., 2020; Hoffmann et al., 2022; Sun et al., 2025), which could predict the model loss based on crucial factors (e.g., data/model sizes) before training, shed lights on the promising way of wisely selecting appropriate model structures and settings before experiments and continuously enhancing the ability of LLMs under given training budget or environment constraints. Recently, Mixture-of-Experts (MoE) becomes one of the mainstream structures broadly used in powerful industry-level LLMs (Dubey et al., 2024; Liu et al., 2024; Sun et al., 2024; Liu et al., 2025; Qwen Team et al., 2025; OpenAI et al., 2025).


When People are Floods: Analyzing Dehumanizing Metaphors in Immigration Discourse with Large Language Models

arXiv.org Artificial Intelligence

Metaphor, discussing one concept in terms of another, is abundant in politics and can shape how people understand important issues. We develop a computational approach to measure metaphorical language, focusing on immigration discourse on social media. Grounded in qualitative social science research, we identify seven concepts evoked in immigration discourse (e.g. "water" or "vermin"). We propose and evaluate a novel technique that leverages both word-level and document-level signals to measure metaphor with respect to these concepts. We then study the relationship between metaphor, political ideology, and user engagement in 400K US tweets about immigration. While conservatives tend to use dehumanizing metaphors more than liberals, this effect varies widely across concepts. Moreover, creature-related metaphor is associated with more retweets, especially for liberal authors. Our work highlights the potential for computational methods to complement qualitative approaches in understanding subtle and implicit language in political discourse.


Machine Unlearning via Information Theoretic Regularization

arXiv.org Machine Learning

How can we effectively remove or "unlearn" undesirable information, such as specific features or individual data points, from a learning outcome while minimizing utility loss and ensuring rigorous guarantees? We introduce a mathematical framework based on information-theoretic regularization to address both feature and data point unlearning. For feature unlearning, we derive a unified solution that simultaneously optimizes diverse learning objectives, including entropy, conditional entropy, KL-divergence, and the energy of conditional probability. For data point unlearning, we first propose a novel definition that serves as a practical condition for unlearning via retraining, is easy to verify, and aligns with the principles of differential privacy from an inference perspective. Then, we provide provable guarantees for our framework on data point unlearning. By combining flexibility in learning objectives with simplicity in regularization design, our approach is highly adaptable and practical for a wide range of machine learning and AI applications.


Causal Inference Tools for a Better Evaluation of Machine Learning

arXiv.org Artificial Intelligence

We present a comprehensive framework for applying rigorous statistical techniques from econometrics to analyze and improve machine learning systems. We introduce key statistical methods such as Ordinary Least Squares (OLS) regression, Analysis of Variance (ANOVA), and logistic regression, explaining their theoretical foundations and practical applications in machine learning evaluation. The document serves as a guide for researchers and practitioners, detailing how these techniques can provide deeper insights into model behavior, performance, and fairness. We cover the mathematical principles behind each method, discuss their assumptions and limitations, and provide step-by-step instructions for their implementation. The paper also addresses how to interpret results, emphasizing the importance of statistical significance and effect size. Through illustrative examples, we demonstrate how these tools can reveal subtle patterns and interactions in machine learning models that are not apparent from traditional evaluation metrics. By connecting the fields of econometrics and machine learning, this work aims to equip readers with powerful analytical tools for more rigorous and comprehensive evaluation of AI systems. The framework presented here contributes to developing more robust, interpretable, and fair machine learning technologies.


Attribution Methods in Asset Pricing: Do They Account for Risk?

arXiv.org Artificial Intelligence

Over the past few decades, machine learning models have been extremely successful. As a result of axiomatic attribution methods, feature contributions have been explained more clearly and rigorously. There are, however, few studies that have examined domain knowledge in conjunction with the axioms. In this study, we examine asset pricing in finance, a field closely related to risk management. Consequently, when applying machine learning models, we must ensure that the attribution methods reflect the underlying risks accurately. In this work, we present and study several axioms derived from asset pricing domain knowledge. It is shown that while Shapley value and Integrated Gradients preserve most axioms, neither can satisfy all axioms. Using extensive analytical and empirical examples, we demonstrate how attribution methods can reflect risks and when they should not be used.


Policy design in experiments with unknown interference

arXiv.org Artificial Intelligence

This paper studies experimental designs for estimation and inference on policies with spillover effects. Units are organized into a finite number of large clusters and interact in unknown ways within each cluster. First, we introduce a single-wave experiment that, by varying the randomization across cluster pairs, estimates the marginal effect of a change in treatment probabilities, taking spillover effects into account. Using the marginal effect, we propose a test for policy optimality. Second, we design a multiple-wave experiment to estimate welfare-maximizing treatment rules. We provide strong theoretical guarantees and an implementation in a large-scale field experiment.


Which linguistic cues make people fall for fake news? A comparison of cognitive and affective processing

arXiv.org Artificial Intelligence

Fake news on social media has large, negative implications for society. However, little is known about what linguistic cues make people fall for fake news and, hence, how to design effective countermeasures for social media. In this study, we seek to understand which linguistic cues make people fall for fake news. Linguistic cues (e.g., adverbs, personal pronouns, positive emotion words, negative emotion words) are important characteristics of any text and also affect how people process real vs. fake news. Specifically, we compare the role of linguistic cues across both cognitive processing (related to careful thinking) and affective processing (related to unconscious automatic evaluations). To this end, we performed a within-subject experiment where we collected neurophysiological measurements of 42 subjects while these read a sample of 40 real and fake news articles. During our experiment, we measured cognitive processing through eye fixations, and affective processing in situ through heart rate variability. We find that users engage more in cognitive processing for longer fake news articles, while affective processing is more pronounced for fake news written in analytic words. To the best of our knowledge, this is the first work studying the role of linguistic cues in fake news processing. Altogether, our findings have important implications for designing online platforms that encourage users to engage in careful thinking and thus prevent them from falling for fake news.


fmeffects: An R Package for Forward Marginal Effects

arXiv.org Machine Learning

Forward marginal effects (FMEs) (Scholbeck et al., 2022) provide simple yet accurate local modelagnostic explanations in terms of forward differences in prediction. They address questions of the form: If we change x by an amount h, what is the change in predicted outcome ŷ? For instance, given a medical study where a model is trained to predict a patient's disease risk, FMEs can tell us each patient's individual change in predicted risk due to losing 5kg in body weight. FMEs thus provide actionable and comprehensible advice for stakeholders, including ones without expertise in machine learning. If the change in predicted risk is substantial enough, doctors may recommend a tailored exercise and nutrition regimen.